Ceyhun Onur

01/01/2020

Analyzing Bet Data

In this notebook we tried to analyze the bet odds and their implied probabilities. This is our first approach to predict outcomes of EPL matches. We thought that bet odds are reliable since bet companies constantly analyze every aspect of teams, including newest events and statuses of players and all related variables. So they probably perform different kinds of ML and statistical approaches to determine their odds. So we decided to analyze their odds and try to find outcome probabilities based on their probability. We tried to find minimum RPS among the bet companies. Since the first rounds of submissions were early, we were not ready to construct our ML models. In the meanwhile we did not want to fall behind in competition, so decided to use this approach for our initial submissions. Also we kept analyzing probabilities with this approach to make a backup in case of having ā€œbadā€ outcomes from ML models.

library(data.table)

Read Matches

We read finished matches from the data (we update data just before every submission round). We also try to determine the outcome as -1,0,1 corresponding to away win, draw, home win by looking match scores.

# Read frome the snapshot, select only columns we're interested in.
matches <- fread(paste0(getwd(), "/data/matches.csv"), select=c("match_id", "match_hometeam_name", "match_awayteam_name", "league_id","match_status","match_hometeam_score", "match_awayteam_score", "epoch"))
# get only finished english premier league matches
finished_epl_matches <- matches[league_id == 148 & match_status == "Finished"]
finished_epl_matches[, outcome := (ifelse(match_hometeam_score == match_awayteam_score, 0 , ifelse(match_hometeam_score > match_awayteam_score, 1 , -1)))]
finished_epl_matches

Read bets

We read bet data (again we update them just before round submission). We also have a ā€œbet_extraā€ datatable which is something we tried to construct from online data. Some matches’ bet data were missing in given csv file(on google drive). So we had to find these bet data on our own. We used www.betexplorer.com to gather missing data. Each missing match we gathered data from the webpage and put them into our bet_extra.csv file. Then we merge two files ā€œbets.csvā€, ā€œbets_extra.csvā€.

# read bets csv
bets <- fread(paste0(getwd(), "/data/bets.csv"))
# add column to indicate if match ended as draw
# get only bets data with win, draw or lose.
bets <- bets[variable %in% c("odd_1", "odd_x", "odd_2" )]
bets_extra <- fread(paste0(getwd(), "/data/bets_extra.csv"))
names(bets_extra)[1] <- "match_id"
names(bets_extra)[2] <- "odd_bookmakers"
names(bets_extra)[3] <- "odd_1"
names(bets_extra)[4] <- "odd_x"
names(bets_extra)[5] <- "odd_2"
modified_time = file.info(paste0(getwd(), "/data/bets.csv"))$mtime 
bets_extra = melt(bets_extra, id.vars = c("match_id", "odd_bookmakers"),
                measure.vars = c("odd_1", "odd_x", "odd_2"))
bets_extra[, odd_epoch := as.numeric(modified_time)]
bets <- rbind(bets, bets_extra)
bets = bets[order(match_id)]
bets

We merge bet and finished epl matches.

# join bets and match data by match_id
merged_data <- merge(finished_epl_matches, bets, by='match_id')
merged_data

Custom functions

We created 3 functions to find RPS on table. RPS_single was for test purposes to, we actually don’t use it. We use RPS_matrix in this anaylsis. It takes probabilities(away,draw,win) and their corresponding outcome set. Then finds an RPS value. RPS_model is constructed for our ML model.

RPS_single<- function(probs,outcomes){
  probs = cumsum(probs)
  outcomes = cumsum(outcomes)
  RPS = sum((probs-outcomes )^2) / (length(probs)-1)
  return(RPS)
}

RPS_matrix<- function(probs,outcomes){
  probs=as.matrix(probs)
  outcomes=as.matrix(outcomes)
  probs=t(apply(t(probs), 2, cumsum))
  outcomes=t(apply(t(outcomes), 2, cumsum))
  RPS = apply((probs-outcomes)^2,1,sum) / (ncol(probs)-1)
  return(RPS)
}

RPS_model<- function(home,draw,away,outcome){
  probs = cbind(home,draw,away)
  outcome = as.integer(outcome)
  outcome_h = as.integer(outcome == 1)
  outcome_d = as.integer(outcome == 2)
  outcome_a = as.integer(outcome == 3)
  outcomes = cbind(outcome_h, outcome_d,outcome_a)
  return(RPS_matrix(probs,outcomes))
}

bookmaker_analysis takes bookmaker bet data and converts them to implied probabilities.

# Use this function to analyze for different bookmakers.
bookmaker_analysis <- function(bookmaker_name, data) {
  # get the data with selected bookmaker
  if(bookmaker_name == TRUE){
    bookmaker_bets <- data
  }
  else{
    bookmaker_bets <- data[odd_bookmakers == bookmaker_name]
  }
  # find probabilities from given odds
  bookmaker_bets <- bookmaker_bets[, c("p_win", "p_draw", "p_lose") := list(1/value[variable == 'odd_1'], 1/value[variable == 'odd_x'], 1/value[variable == 'odd_2']), by = list(match_id, odd_bookmakers, variable)]
  # group data by match_id and odd_bookmakers, so that we can have only one row for each match and bookmaker
  bookmaker_bets <- aggregate(bookmaker_bets[,c("p_win","p_draw","p_lose")], by=list(bookmaker_bets$match_id, bookmaker_bets$odd_bookmakers), FUN = function (x) first(na.omit(x)))
  # convert it back to data table
  bookmaker_bets <- data.table(bookmaker_bets)
  # rename grouped columns
  names(bookmaker_bets)[1] <- "match_id"
  names(bookmaker_bets)[2] <- "odd_bookmakers"
  # convert probabilities by implied probability function for each probability type
  bookmaker_bets[, p_win_imp:=p_win / (p_win + p_draw + p_lose)]
  bookmaker_bets[, p_draw_imp:=p_draw / (p_win + p_draw + p_lose)]
  bookmaker_bets[, p_lose_imp:=p_lose / (p_win + p_draw + p_lose)]
}

Past Match Predictions

We try to analyze past matches with their RPS values. There can be many matches with same ID belongs to different bookmaker, since each bookmaker gives different implied probability to each matches.

past_match_predictions <- bookmaker_analysis(TRUE, merged_data)
past_match_predictions <- past_match_predictions[, -c("p_win","p_draw","p_lose")]
setDT(past_match_predictions, key = "match_id")[finished_epl_matches, outcome := i.outcome]
setDT(past_match_predictions, key = "match_id")[finished_epl_matches, epoch := i.epoch]
past_match_predictions

This one creates a different matrix for outcomes in a form that our RPS functions expect. Then we find RPS value of each finished match according to their outcomes and their implied probability.

past_match_predictions[,pred_id:=1:.N]
outcome_for_rps=data.table::dcast(past_match_predictions,pred_id~outcome,value.var='pred_id')
outcome_for_rps[,pred_id:=NULL]
outcome_for_rps[is.na(outcome_for_rps)]=0
outcome_for_rps[outcome_for_rps>0]=1

overall_results <- past_match_predictions[,rps := RPS_matrix(past_match_predictions[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
overall_results[,pred_id:=NULL]
overall_results <- data.table(overall_results)
overall_results

Here we find each bookmaker’s average RPS. We tried to construct a ā€œleaderboardā€ for bookmakers where they’re sorted by their mean RPS values. So we can select best bookmakers from this table. There are some bookmakers did really good job in terms of RPS but we found that they did not predict many matches. So we did not manually select bookmaker names from this table.

bookmaker_rps <- aggregate(overall_results[, rps], list(overall_results$odd_bookmakers), mean)

names(bookmaker_rps)[1] <- "odd_bookmakers"
names(bookmaker_rps)[2] <- "rps_mean"
bookmaker_rps <- data.table(bookmaker_rps)
bookmaker_rps <- bookmaker_rps[order(rps_mean)]
bookmaker_rps
mean(bookmaker_rps$rps_mean)
## [1] 0.2276873

Selecting ā€œthe bestā€ bookmaker would not be the wisest approach since depending only one bookmaker is not a great opinion. Instead we tried to take average of ā€œbest n bookmakerā€ implied probabilities. In order to do that we had to find this ā€œnā€. At the beginning we manually set this n as 3 (n=3). However then we realized that we can take it further and find this n from the data. We ran this code once to find optimal n. As our findings, having n=5 is the optimal choice. Since this code breaks the other parts and we did run it only once, we commented it out.

# Don't run this with other codes. This is a test code for to find optimal n (5 currently).
for (n in seq(0,60,1)) {
  setDT(past_match_predictions, key = "odd_bookmakers")[bookmaker_rps, bookmaker_rps_ave := i.rps_mean]
  first_n_bookmakers = past_match_predictions[order(match_id, bookmaker_rps_ave)]
  first_n_bookmakers = first_n_bookmakers[, .SD[1:n], by=match_id]
  submission = first_n_bookmakers[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
  past_match_predictions = past_match_predictions[, .SD, by=match_id]
  setDT(submission, key = "match_id")[past_match_predictions, outcome := i.outcome]
  submission[,pred_id:=1:.N]
  outcome_for_rps=data.table::dcast(submission,pred_id~outcome,value.var='pred_id')
  outcome_for_rps[,pred_id:=NULL]
  outcome_for_rps[is.na(outcome_for_rps)]=0
  outcome_for_rps[outcome_for_rps>0]=1
  
  overall_results <- submission[,rps := RPS_matrix(submission[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
  overall_results[,pred_id:=NULL]
  overall_results <- data.table(overall_results)
  overall_results = overall_results[c(-1,-2)]
  cat("n= ",n,"mean= ",mean(overall_results$rps, na.rm=TRUE),"\n")
}

Predicting upcoming matches

We find upcoming matches by looking at their epochs.

unplayed_epl_matches <- matches[league_id == 148 & is.na(match_status) & epoch < as.integer(as.POSIXct("2019-12-30"))] 
unplayed_epl_matches <- unplayed_epl_matches[,-c("match_status", "match_hometeam_score", "match_awayteam_score", "league_id")]
unplayed_epl_matches

We also convert bet odds of upcoming matches to implied probabilities. We also put bookmaker’s mean RPS value in the table.

unplayed_merged_data <- merge(unplayed_epl_matches, bets, by='match_id')
unplayed_predictions <- bookmaker_analysis(TRUE, unplayed_merged_data)
unplayed_predictions <- unplayed_predictions[, -c("p_win","p_draw","p_lose")]

unplayed_predictions <-  merge(unplayed_epl_matches, unplayed_predictions, by='match_id')
setDT(unplayed_predictions, key = "odd_bookmakers")[bookmaker_rps, bookmaker_rps_ave := i.rps_mean]
unplayed_predictions

We take average of implied probabilities just to see them.

average_cols = c("p_win_imp","p_draw_imp","p_lose_imp")
average_predictions = unplayed_predictions[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
average_predictions = merge(unplayed_epl_matches, average_predictions, by='match_id')
average_predictions

Here we find best ā€œnā€ bookmakers and take average of their implied probabilities. Concatinate them to form our submission text.

n = 5
first_n_bookmakers = unplayed_predictions[order(match_id, bookmaker_rps_ave)]
first_n_bookmakers = first_n_bookmakers[, .SD[1:n], by=match_id]
submission = first_n_bookmakers[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
submission_txt = paste(t(submission[, c("match_id", "p_win_imp", "p_draw_imp", "p_lose_imp")]), collapse=',')
submission_txt
## [1] "273291,0.320750168385935,0.255694206555901,0.423555625058164,273292,0.514999959326059,0.256387278343203,0.228612762330738,273293,0.245522815911084,0.265106226946046,0.48937095714287,273294,0.724441681576651,0.178925550853189,0.0966327675701606,273295,0.813337601450492,0.125833146727795,0.0608292518217134,273296,0.294848365795222,0.283994975257029,0.421156658947749,273297,0.181391106086424,0.221807943079548,0.596800950834028,273298,0.474215762202999,0.275686182129787,0.250098055667214,273299,0.485795365926461,0.251914734354526,0.262289899719013,273300,0.376835456695125,0.267121062849755,0.35604348045512"

We can find our submission RPS value by providing outcome set.

submission$outcome = c(-1, -1, 1, 1,0,0,0,1,1,0)
submission[,pred_id:=1:.N]
outcome_for_rps=data.table::dcast(submission,pred_id~outcome,value.var='pred_id')
outcome_for_rps[,pred_id:=NULL]
outcome_for_rps[is.na(outcome_for_rps)]=0
outcome_for_rps[outcome_for_rps>0]=1

overall_results <- submission[,rps := RPS_matrix(submission[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
overall_results[,pred_id:=NULL]
overall_results <- data.table(overall_results)
overall_results